nnLB: Next Pitch

A Novel Deep Learning Approach to MLB Pitch Prediction Using In-Game Video Footage

logo

Abstract

The importance of analytics in baseball has grown considerably in recent decades. Accordingly, Major League Baseball (MLB) organizations have invested substantial resources into the research and development of advanced statistical methods that can be leveraged to gain competitive advantages. Pitch prediction has emerged as one of these active areas of research. Here we develop a novel deep learning approach for pitch prediction. Using pose estimation time-series data from in-game video footage, we train two pitcher-specific convolutional neural networks (CNNs) to predict the pitches of Tyler Glasnow (2019 season) and Walker Buehler (2021 season). Notably, our selected model achieves a prediction accuracy of 87.1% and an area under the curve (AUC) of 0.919 on a holdout test set for Tyler Glasnow’s 2019 season. These results demonstrate the effectiveness of using in-game video footage and deep learning for pitch prediction tasks.

Introduction

Organizations across all major sports leagues have adopted data-driven decision-making approaches to remain competitive in recent decades. Among these leagues, Major League Baseball is widely recognized as the pioneer in embracing analytics. In fact, an entire domain of sports-based analytics, termed sabermetrics, is devoted to baseball-specific statistics and analysis. Consequently, a wealth of high-resolution public data and untapped opportunities exist within the world of baseball.

Baseball enthusiasts would agree that success within the sport relies heavily on the game within the game. Identifying and exploiting small advantages can yield significant returns in achieving desired outcomes. Here, we introduce a deep learning method that utilizes in-game video footage to predict pitches. This endeavor is motivated by two factors. First, a reliable pitch classifier can provide batters with an edge during live at-bats. Second, an interpretable deep learning model can give pitchers insight into how predictable they are and how they can conceal their pitches more effectively.

Methods

Defining the Sample

Prediction tasks must be segmented by pitcher since pitchers have unique motions, tendencies, and pitching arsenals (i.e., pitchers throw different types of pitches). As such, we decided to focus our proof of concept analysis on two pitchers: Tyler Glasnow (2019 regular season) and Walker Buehler (2021 regular season). Further, most pitchers pitch from two separate positions (the windup and the stretch), which is conditional on game situation. We decided to focus on pitches thrown from the stretch since the motion is more compact.

Data Collection

I. Web Scraping

BaseballSavant is a website dedicated to providing the public with access to historical MLB data. These data include video footage and Statcast tabular data for every pitch thrown in the MLB since 2018 and 2015, respectively. We built web scrapers to retrieve both the video source URL and pitch type for every pitch of interest. Video source URLs were used as inputs for our feature engineering pipeline.

II. Feature Engineering

A highlight in our work is the feature engineering process, termed the Video2Data pipeline. The pipeline works as follows. First, a video is downloaded from the source URL and is converted to a series of images (or frames). Second, an object detection model is used to determine the location of the pitcher in each frame. The coordinates reported by the model are subsequently used to blur the background of each image. Blurring is performed using OpenCV. The object detection model used in this step is a custom Detectron2 model (Faster R-CNN) that was trained on a self-annotated data set to specifically detect pitchers. This step is necessary for scalable and reliable feature extraction since the OpenPose pose estimation software (used in the following step) detects humans non-specifically. Third, the OpenPose pose estimation software is used on each image to extract the coordinates of 25 keypoints on the pitcher’s body. Keypoint coordinates from each frame are finally merged into a similar data structure. Example outputs generated at each step of the Video2Data pipeline are shown in Figure 1.

Figure 1. Example Video2Data pipeline outputs. (1) Video to image conversion (left). (2) Pitcher detection and background blurring (middle). (3) OpenPose pose estimation (right).Figure 1. Example Video2Data pipeline outputs. (1) Video to image conversion (left). (2) Pitcher detection and background blurring (middle). (3) OpenPose pose estimation (right).Figure 1. Example Video2Data pipeline outputs. (1) Video to image conversion (left). (2) Pitcher detection and background blurring (middle). (3) OpenPose pose estimation (right).

Figure 1. Example Video2Data pipeline outputs. (1) Video to image conversion (left). (2) Pitcher detection and background blurring (middle). (3) OpenPose pose estimation (right).

III. Data Preprocessing

Videos accessible through BaseballSavant are inherently inconsistent (i.e., in terms of video duration, camera perspective, etc). Additionally, the Video2Data pipeline produced some undesired artifacts that could be harmful for machine learning applications (e.g., missingness and pose estimation errors). As such, we implemented a data cleaning process that both detects/removes unusable observations and prepares usable observations for modeling.

The first step of our data preprocessing method addresses the inconsistent video duration problem. To crop each video to a similar range, we use pose estimation data to identify the frame at which the pitcher’s knee reaches it’s \(y_{max}\), which we refer to as \(t_{y_{max}}\) (i.e., the frame at which the left knee’s \(y\)-coordinate is maximized if the pitcher is right-handed, and vice versa). We decided to use \(t_{y_{max}}\) as a reference time point since \(t_{y_{max}}\) reliably corresponds to a common event in a pitcher’s delivery (termed the peak leg lift), irrespective of who is pitching. We then use \(t_{y_{max}}\) to determine the start and end frames, termed \(t_{start}\) and \(t_{end}\), using Equations 1-3. Briefly, \(t_{start}\) and \(t_{end}\) are computed for each observation and are used to crop each observation’s OpenPose coordinate data to a common 15 time points. An example of the frame range included for our prediction task is shown in Figure 2. Observations with \(t_{start} < 0\) are considered to have an insufficient number of frames for inclusion and are removed from consideration.

\[ n_{frames} = 15 \tag{Equations 1-3}\\ t_{end} = t_{y_{max}} + 2\\ t_{start} = t_{end} - (n_{frames}-1)\\ \]

Figure 2. An example of the cropped frame range used for pitch classification.Figure 2. An example of the cropped frame range used for pitch classification.

Figure 2. An example of the cropped frame range used for pitch classification.

The second step of our data preprocessing method aims to further remove erroneous observations. Three selection criteria are used for evaluation, outside of the criteria established previously. First, we require the angle between each knee at \(t_{y_{max}}\) (the peak leg lift) to be between 40° and 90°. This selection criteria allows us to identify video footage that is taken from an atypical camera perspective, given that the standard camera perspective faces from center field to home plate. For example, in no circumstance would the angle between knees at \(t_{y_{max}}\) be between 40° and 90° if the video camera is facing from third to first base. Second, we require observations to have at least 80% of \(x\), \(y\) coordinate pairs available. Third, we perform a manual inspection using data visualizations to identify erroneous observations that were not filtered using the aforementioned selection criteria.

The third and final step of our data preprocessing method involves preparing selected observations for modeling. First, we handle missing coordinate data using the pandas time series aware interpolate() function. Second, we select 10 keypoints (or body parts) to include in downstream applications. These keypoints include the neck, the left and right hip, the left and right knee, the left and right ankle, and the wrist, elbow, and shoulder of the pitcher’s throwing arm (e.g., the right wrist, elbow, and shoulder for right-handed pitchers). Third, we normalize all \(x\) and \(y\) coordinates for a given observation to be between 0 and 1. The \(x\) coordinates for a given observation are normalized using \(x_{min}\) and \(x_{max}\) from the full set of \(x\) coordinates for that particular observation (i.e., the \(x\) values from all keypoints). The \(y\) coordinates for a given observation are similarly normalized. The aim here is to address between-video variation in pitcher location and scale. This normalization step is not to be confused with the feature scaling step that is used during modeling. Fourth, and finally, we label our observations with the pitch type outcome. Many well-cited articles dichotomize the pitch prediction task into a binary classification problem of predicting fastball versus non-fastball. In this report, we dichotomize the pitch prediction task into three classes based on Statcast labeling: fastball (FB), fastball with movement (FBwM), and off-speed (OFF). This dichotomization is presented in Table 1. Note that Tyler Glasnow’s pitch type set includes only fastballs and off-speed pitches. Thus, Tyler Glasnow’s pitch prediction task is dichotomized into a binary classification problem. The final feature set was three-dimensional per observation and took the shape of (15, 10, 2) (i.e., 15 time points, 10 body parts, and two coordinates (\(x\), \(y\))).

Table 1. Outcome dichotomization.
Pitch Type Class Class Make Up
FB Fastball (FA), Four-Seam Fastball (FF)
FBwM Two-Seam Fastball/Sinker (FT/SI), Cutter (FC), Splitter (FS/SF)
OFF Changeup (CH), Slider (SL), Curveball (CB/CU), Knuckle-Curve (KC), Knuckleball (KN)

Model Development

Model Specifications

We designed four candidate convolutional neural network (CNN) architectures for our prediction task. Architecture specifics are described in the section below. Constant model hyperparameters and training methods are specified here. All models utilize the Adam optimizer with a starting learning rate of \(1 \times 10^{-4}\), and a batch size of 16. Learning rates were reduced by a factor of \(0.5 \times 10^{-2}\) during training if the validation loss did not improve for 10 epochs. Similar dropout layers and regularization penalties were used for each model. We used a dropout rate of 0.3 in dropout layers and applied L1 and L2 regularization penalties (\(1 \times 10^{-2}\) and \(1 \times 10^{-4}\), respectively) on the kernel of each model’s last three (non-output) dense layers. The maximum number of epochs was set to 200. Training was stopped early, however, if validation loss did not decrease over the course of 50 epochs. Binary cross-entropy and categorical cross-entropy were used as the loss functions for binary and multiclass classification models, respectively. Loss functions were weighted according to outcome class weights to account for class imbalance. Binary and multiclass classification models use sigmoid and softmax activation functions in the output layer, respectively. All models were implemented using Keras (version 2.12.0) in Python (version 3.10.11).

Model Evaluation

Prior to modeling, we split the data set into a cross-validation (CV) set (80% of the data) and holdout test set (20% of the data). CV and test set sample sizes are shown for each pitcher in Table 2. Data set splits were stratified by the outcome to ensure balanced class representation across each set. We then used stratified \(k\)-fold CV with five folds to evaluate each candidate architecture. During 5-fold CV, the CV data set is split into 5 folds. Five separate models (per candidate model) are then trained using \(k-1\) (four) folds as the training data set and one fold as the validation data set. The original holdout test set is used to evaluate each of the five models. Features were standardized by removing the mean and scaling to unit variance. The mean and variance values used for standardization were learned from the training set and applied to the validation set and holdout test set during each round of 5-fold CV.

Table 2. Cross-validation set and holdout test set sample sizes.
Tyler Glasnow
Walker Buehler
Pitch Type CV Set Test Set Class % (Test) CV Set Test Set Class % (Test)
FB 373 93 64.6% 297 75 41.7%
FBwM 170 42 23.3%
OFF 202 51 35.4% 249 63 35%
Total 575 144 719 716 180 896

Prediction accuracy, area under the receiver operating characteristic curve (ROC-AUC), area under the precision-recall curve (AUPRC), Matthew’s correlation coefficient (MCC), and weighted F1-score were used to evaluate the predictive performance of each model. We report the mean of each performance metric (from the aforementioned 5-fold CV, using the original holdout test set for evaluation) for each model architecture. Additionally, we use one-vs-rest (OvR) ROC-AUC and AUPRC for multiclass classification evaluation since ROC-AUC and AUPRC are traditionally used for binary classification. The one-vs-rest approach binarizes the predictions into reference and non-reference classes to compute metrics (in this case ROC-AUC and AUPRC) for each class. MCC serves as a useful all-purpose metric for between-pitcher comparisons since it can be used to evaluate both binary and multiclass prediction performance.

Candidate Model Architectures

I. CNN-2D

The CNN-2D model was originally designed as a baseline model that the more complex model architectures could be compared to. We term this model architecture CNN-2D since we reshape the 3-dimensional (3D) input shape of (15, 10, 2) to a 2-dimensional (2D) shape (10, 30) (i.e., the time steps and \(x\), \(y\) coordinates are collapsed). We additionally tried using a 2D input shape of (15, 20), whereby the body part dimension and \(x\), \(y\) coordinates are collapsed, but ultimately chose the (10, 30) configuration due to improved performance. The CNN-2D architecture consists of four convolutional layers that are connected to a series of six dense layers. Convolutional layers consist of 2D convolutions, using kernels of size (3, 1), followed by batch normalization and rectified linear unit (ReLU) activation functions. We found that (3, 1) kernels outperformed (3, 3) kernels in terms of ROC-AUC and accuracy. An increasing number of filters (8, 16, 32, and 64 filters) were used for consecutive convolutional layers. The output of the convolutional block is connected to a series of six fully connected dense layers. The first five dense layers incorporate 256 (with dropout), 128 (with dropout), 64 (with regularization), 32 (with regularization), and 16 (with regularization) nodes and ReLU activations. The final dense layer (the output layer) consists of one node when the sigmoid activation function is used and three nodes when the softmax activation function is used.

II. CNN-3D

The CNN-3D model architecture strongly resembles the CNN-2D model architecture, although the 3D input shape is preserved (hence CNN-3D). Distinct from the CNN-2D model architecture, convolutional layers use 3D convolutions with kernels of size (3, 3, 1). We were restricted to using this kernel size due to the size of the 3D input. All other model architecture details are consistent with the CNN-2D model architecture.

III. CNN-3D-LSTM

The CNN-3D-LSTM model architecture builds off of the CNN-3D model architecture by incorporating a 256-unit long short-term memory (LSTM) layer between the convolutional block and dense layers. All other model architecture details are identical to the CNN-3D model architecture.

IV. Branched CNN

The Branched CNN model architecture was inspired from an article that uses a convolutional neural network to predict mortality from 12-lead electrocardiogram voltage data8. Briefly, the Branched CNN model architecture consists of 10 separate convolutional branches that independently extract features from each keypoint (or body part). The features extracted by each branch are concatenated and fed to a series of fully connected dense layers that are identical to all other model architectures. Similar to the convolutional block used in the CNN-2D architecture, each branch consists of four convolutional layers that use 2D convolutions and kernels of size (3, 1). However, ReLU activations are (unconventionally) used directly after convolutions (and before batch normalization). We found that that the Conv2D-BatchNormalization-ReLU configuration used in other model architectures led to inconsistent and diminished model performance. Additionally, global average pooling is independently applied to the output of each convolutional branch prior to concatenation.

Results

Tyler Glasnow

Binary pitch classification results from Tyler Glasnow’s 2019 regular season are shown in Table 3 and Figure 3 below. The CNN-2D model definitively outperformed all other models, achieving impressive AUC and AUPRC scores of 0.919 and 0.839 on the holdout test data set, respectively. An AUC greater than 0.9 suggests that the model is able to discriminate between fastballs and off-speed pitches exceptionally well. Further, the model’s prediction capability surpasses that of a naive classifier (i.e., a model that strictly predicts the majority class) substantially, achieving an accuracy of 87.1% (compared to a naive classifier’s accuracy of 64.6%). To put this result into perspective, the model was able to accurately predict almost eight out of every nine pitches.

Table 3. Tyler Glasnow binary pitch classification evaluation, stratified by model type. Reported metrics are mean values from 5-fold cross-validation, whereby the holdout test data set was used for evaluation.
Performance Evaluation
Model F1-Score (Weighted) ROC-AUC AUPRC MCC Accuracy Majority Class %
CNN-2D 0.869 0.919 0.839 0.714 87.1% 64.6%
CNN-3D 0.820 0.896 0.809 0.605 82.2% 64.6%
CNN-3D-LSTM 0.823 0.889 0.810 0.614 82.8% 64.6%
Branched CNN 0.819 0.877 0.774 0.606 82.2% 64.6%
Figure 3. Multiclass pitch classification metric visualization for Tyler Glasnow.

Figure 3. Multiclass pitch classification metric visualization for Tyler Glasnow.

Figure 4. Training and validation loss curves for 5-fold CV binary classification models (Tyler Glasnow models).

Figure 4. Training and validation loss curves for 5-fold CV binary classification models (Tyler Glasnow models).

Walker Buehler

Multiclass pitch classification results from Walker Buehler’s 2021 regular season are shown in Table 4 and Figure 5 below. Again, the CNN-2D model definitively outperformed all other models, achieving a weighted F1-score of 0.638 and 63.6% accuracy (compared to a naive classifier’s accuracy of 41.7%). On the surface, the results here are not as compelling as those reported for the binary classification of Tyler Glasnow’s pitches. The results are a bit more interesting, however, when considering class-specific performance metrics such as one-versus-rest AUC. The model achieves one-versus-rest AUC scores of 0.825 and 0.852 for fastballs and off-speed pitches, respectively. This indicates that the model is able to effectively discriminate between both fastballs and non-fastballs and off-speed and non-off-speed pitches. The model’s performance is hampered by an inability to discriminate between fastballs with movement and other pitch types. All things considered, we might expect to see more exciting results if we converted this multiclass prediction task to a binary prediction task.

Table 4. Walker Buehler multiclass pitch classification evaluation, stratified by model type. Reported metrics are mean values from 5-fold cross-validation, whereby the holdout test data set was used for evaluation. Confusion matrices report the culmination of predictions from all 5 CV models.
Confusion Matrices
Performance Evaluation
Model Actual↓ Predicted→ FB FBwM OFF F1-Score ROC-AUC (OvR) AUPRC (OvR) MCC Accuracy Class %
CNN-2D FB 270 74 31 0.698 0.825 0.760 0.442 63.6% 41.7%
FBwM 81 93 36 0.426 0.701 0.425 23.3%
OFF 48 58 209 0.708 0.852 0.762 35%
Weighted Ave. 0.638 0.807 0.691
CNN-3D FB 253 68 54 0.666 0.791 0.728 0.371 59.3% 41.7%
FBwM 71 72 67 0.362 0.679 0.384 23.3%
OFF 61 45 209 0.648 0.808 0.690 35%
Weighted Ave. 0.589 0.774 0.641
CNN-3D-LSTM FB 250 103 22 0.669 0.807 0.723 0.368 58.1% 41.7%
FBwM 73 94 43 0.382 0.635 0.345 23.3%
OFF 50 86 179 0.638 0.818 0.729 35%
Weighted Ave. 0.591 0.767 0.633
Branched CNN FB 241 70 64 0.581 0.693 0.627 0.230 50.1% 41.7%
FBwM 113 51 46 0.217 0.604 0.340 23.3%
OFF 97 59 159 0.542 0.735 0.611 35%
Weighted Ave. 0.483 0.686 0.538
Figure 5. Multiclass pitch classification metric visualization for Walker Buehler. One-vs-rest metrics are stratified on the x-axis by pitch type. Pitch types include fastball (FB), fastball with movement (FBwM), and off-speed (OFF).

Figure 5. Multiclass pitch classification metric visualization for Walker Buehler. One-vs-rest metrics are stratified on the x-axis by pitch type. Pitch types include fastball (FB), fastball with movement (FBwM), and off-speed (OFF).

Figure 6. Training and validation loss curves for 5-fold CV multiclass classification models (Walker Buehler models).

Figure 6. Training and validation loss curves for 5-fold CV multiclass classification models (Walker Buehler models).

Discussion

A Statement on Model Performance

To my surprise, our CNN-2D model conclusively outperformed our more complex model architectures. Further consideration will be required to determine the cause for this. Perhaps we can use this result to make adjustments to our more complex models and achieve even better performance.

Implications

Results from this proof of concept demonstrate the effectiveness of using in-game video footage and deep learning for pitch prediction tasks. We were able to develop a model that is capable of effectively discriminating between pitches using pre-delivery pose information. This tells us that, in some instances (e.g., Tyler Glasnow’s 2019 season), tangible differences exist in a pitcher’s motion or positioning that can help us predict the pitch they are about to throw.

In light of these results, potential applications of this technology are intriguing. Two obvious applications exist in the world of baseball. First, it would be interesting to establish a singular deep learning model architecture that could be trained and tested on any given pitcher. Metrics from a player-specific model could be used to measure that pitcher’s predictability during a certain time span. Pitchers would aim to beat the model and produce poor prediction metrics, which would be indicative of the pitcher’s ability to hide what they are throwing. Second, this technology can be used for outcomes other than pitch classification. For example, organizations could use this methodology to predict pitch quality. This could potentially help interested parties identify either beneficial or detrimental aspects in a pitcher’s pitching mechanics.

Limitations & Next Steps

Despite the promise of these results, we have identified two limitations that should be addressed. First, deep learning approaches require large sample sizes. While our data collection process is capable of gathering large samples of data (e.g., season-long data), organizations might often times be interested in determining a pitcher’s predictability during a short time span. For example, a pitcher might become predictable over the course of a few games. It would be difficult to use a deep learning model to characterize a pitcher’s predictability over such a short time span due to the small sample size. Second, we have yet to explore methods for interpreting classification results. Model interpretability will play a vital role in the usefulness of this technology. We plan to test deep learning interpretation approaches (e.g., Grad-CAM) in future work.